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Abstract 
We design an online algorithm for Principal Component Analysis. In each trial the current instance 
is centered and projected into a probabilistically chosen low dimensional subspace. The regret of 
our online algorithm, that is, the total expected quadratic compression loss of the online algorithm 
minus the total quadratic compression loss of the batch algorithm, is bounded by a term whose 
dependence on the dimension of the instances is only logarithmic. 

We first develop our methodology in the expert setting of online learning by giving an algorithm 
for learning as well as the best subset of experts of a certain size. This algorithm is then lifted to 
the matrix setting where the subsets of experts correspond to subspaces. The algorithm represents 
the uncertainty over the best subspace as a density matrix whose eigenvalues are bounded. The 
running time is O(n) per trial, where n is the dimension of the instances. 

Keywords: principal component analysis, online learning, density matrix, expert setting, quantum 
Bayes rule 


1. Introduction 


In Principal Component Analysis (PCA) the n-dimensional data instances are projected into a k- 
dimensional subspace (k < n) so that the total quadratic compression loss is minimized. After 
centering the data, the problem is equivalent to finding the eigenvectors of the k largest eigenvalues 
of the data covariance matrix. The variance along an eigendirection is always equal to the corre- 
sponding eigenvalue and the subspace defined by the eigenvectors corresponding to the k largest 
eigenvalues is the subspace that captures the largest total variance and this is equivalent to minimiz- 
ing the total quadratic compression loss. 

We develop a probabilistic online version of PCA: in each trial the algorithm chooses a center 
m! and a k-dimensional projection matrix P’~! based on some internal parameter (which summa- 
rizes the information obtained from the previous t — 1 trials); then an instance x’ is received and the 
algorithm incurs compression loss || (x! —m'~!) — P=! (x! — m~!) ||2; finally, the internal parameters 


are updated. The goal is to obtain online algorithms whose total compression loss in all trials is 
T 


close to the total compression loss min 3 || (x — m) — P(x — m)||5 of the batch algorithm which can 
mi i=] 


choose its center and k-dimensional subspace in hindsight based on all T instances. Specifically, 
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in this paper we obtain randomized online algorithms with bounded regret. Here we define regret 
as the difference between the total expected compression loss of the randomized online algorithm 
and the compression loss of the best mean and subspace of rank k chosen offline. In other words 
the regret is essentially the expected additional compression loss incurred by the online algorithm 
compared to normal batch PCA. The expectation is over the internal randomization of the algorithm. 


We begin by developing our online PCA algorithm for the uncentered case, that is, all m = 0 and 
T 


the compression loss of the offline comparator is simplified to min Ł |x — Px ||. In this simpler 
t=1 
case our algorithm is motivated by a related problem in the expert setting of online learning, where 


our goal is to perform as well as the best size k subset of experts. The algorithm maintains a mixture 
vector over the n experts. At the beginning of trial t the algorithm chooses a subset P’~! of k experts 
based on the current mixture vector w’~! that summarizes the previous f — 1 trials. It then receives a 
loss vector Æ € [0, 1]”. Now the subset P~! corresponds to the subspace onto which we “compress” 
or “project” the data. The algorithm incurs no loss on the k components of P’~! and its compression 
loss equals the sum of the remaining n — k components of the loss vector, that is, Vie)... n}—pr-! Ae 
Finally it updates its mixture vector to w’. 

The key insight is to maintain a mixture vector w’‘ as a parameter with the additional constraint 


that wil < a We will show that this “capped” mixture vector represents an implicit mixture 
1-1 


1 


over all subsets of experts of size n — k, and given w'™ we can efficiently sample a subset of 
size n — k from the implicit mixture and choose P’~! as the complementary subset of size k. This 
gives an online algorithm whose total loss over all trials is close to the smallest n — k components 
of the total loss vector £Z; #. We will show how this algorithm generalizes to an online PCA 
algorithm when the mixture vector w‘—! is replaced by a density matrix W‘~! whose eigenvalues are 
capped by a. Now the constrained density matrix W‘~! represents an implicit mixture of (n — k)- 
dimensional subspaces. Again, we can efficiently sample from this mixture, and the complementary 
k-dimensional subspace P’~! is used for projecting the current instance x; at trial t. 

A simple way to construct an online algorithm is to run the offline or batch algorithm on all 
data received so far and use the resulting hypothesis on the next data instance. This is called the 
“Incremental Offline Algorithm” (Azoury and Warmuth, 2001). When the offline algorithm just 
minimizes the loss on the past instances, then this algorithm is also called the “Follow the Leader 
(FL) Algorithm” (Kalai and Vempala, 2005). For uncentered PCA we can easily construct a se- 
quence of instances for which the total online compression loss of FL is ="; times larger than the 
total compression loss of batch PCA. However, in this paper we have a more stringent goal. We 
design randomized online algorithms whose total expected compression loss is at most one times 
the compression loss of batch PCA plus an additional lower order term which we optimize. In other 
words, we are seeking online algorithms with bounded regret. Our regret bounds are worst-case in 
that they hold for arbitrary sequences of instances. 

Simple online algorithms such as the Generalized Hebbian Algorithm (Sanger, 1989) have been 
investigated previously that provably converge to the best offline solution. No worst-case regret 
bounds have been proven for these algorithms. More recently, the online PCA problem was also 
addressed in Crammer (2006). However, that paper does not fully capture the PCA problem because 
the presented algorithm uses a full-rank matrix as its hypothesis in each trial, whereas we use a 
probabilistically chosen projection matrix of the desired rank k. Furthermore, that paper proves 
bounds on the filtering loss, which are typically easier to obtain, and it is not clear how the filtering 
loss relates to the more standard regret bounds for the compression loss proven in this paper. 
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Our algorithm is unique in that we can prove a regret bound for it that is linear in the tar- 
get dimension k of the subspace but logarithmic in the dimension of the instance space. The key 
methodology is to use a density matrix as the parameter and employ the quantum relative entropy as 
a regularizer and measure of progress. This was first done in Tsuda et al. (2005) for a generalization 
of linear regression to the case when the parameter matrix is a density matrix. Our update of the 
density matrix can be seen as a “soft” version of computing the top k eigenvectors and eigenvalues 
of the covariance matrix. It involves matrix logarithms and exponentials which are seemingly more 
complicated than the FL Algorithm which simply picks the top k directions. Actually, the most ex- 
pensive step in both algorithms is to update the eigendecomposition of the covariance matrix after 
each new instance, and this costs O(n”) time (see, e.g., Gu and Eisenstat, 1994). 

The paper is organized as follows. We begin by introducing some basics about batch and online 
PCA (Section 2) as well as the Hedge Algorithm from the expert setting of online learning (Section 
3). We then develop a version of this algorithm that learns as well as the best subset of experts 
of fixed size (Section 4). When lifted to the matrix setting, this algorithm does uncentered PCA 
online (Section 5). Surprisingly, the regret bound for the matrix setting stays the same and this is 
an example of a phenomenon that has been dubbed the “free matrix lunch” (Warmuth, 2007b). We 
briefly discuss the merits of various alternate algorithms in sections 4.1 and 5.1. 

Our online algorithm for centered online PCA is more involved since it has to learn the center 
as well (Section 6). After motivating the updates to the parameters (Section 6.1) we generalize 
our regret bound to the centered case (Section 6.2). We then briefly describe how to construct 
batch PCA algorithms from our online algorithms via standard conversion techniques (Section 6.3). 
Surprisingly, the bounds obtained this way are competitive with the best known batch PCA bounds. 
Lower bounds are discussed in Section 7. A brief experimental evaluation is given in Section 8 and 
we conclude with an overview of online algorithms for matrix parameters and discuss a number of 
open problems (Section 9). 


2. Setup of Batch PCA and Online PCA 


Given a set (or batch) of instance vectors iC eer ou 2 the goal of batch PCA is to find a low- 
dimensional approximation of this data that minimizes the quadratic compression loss. Specifically, 
we want to find a center vector m € R” and a rank k projection matrix! P such that the following 
loss function is minimized: 


[œ — m) — P(x! — m)||3. (1) 


Me 


comp(P,m) = 


Il 
= 


t 


Differentiating and solving for m gives us m* = “x, where “x is the data mean. Substituting this optimal 
center m* into loss (1) we obtain 
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1. Projection matrices are symmetric matrices P with eigenvalues in {0,1}. Note that P? = P. 
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The sum of outer products in the above trace is called the data covariance matrix C. Since I —P is a 
projection matrix, (7 — P)? = I — P, and 


comp(P) =tr(( I—P )C)=tr(C)-tr( P C). 
rank n—k rank k 


We call the above loss the compression loss of P or the loss of subspace I — P. We now give a 
justification for this choice of terminology. Observe that tr(C) equals tr(CP) +tr(C(I — P) ), the sum 
of the losses of the complementary subspaces. However, we project the data into subspace P and 
the projected parts of the data are perfectly reconstructed. We charge the subspace P with the parts 
that are missed, that is, tr((Z — P)C), and therefore call this the compression loss of P. 

We now show that tr(PC) is maximized (or tr((J — P)C) minimized) if P consists of the k 
eigendirections of C with the largest eigenvalues. This proof might seem a digression, but ele- 
ments of it will appear throughout the paper. By rewriting C in terms of its eigendecomposition, 
that is, C= £}; Y; cic} , we can upper bound tr(PC) as follows: 


n n À 
(PC) = a Peel = ic} Po; < max i Oj. 
( ) Lv ( CiC; ) Lv Ci feia 0<8;<1,%; =k Lv 


We can replace the scalars c} Pc; in the ending inequality by the constrained &;’s because of the 
following facts: 


n n 
ci Po; <1, for 1 <i<n, and Yc; Pci = tr(P È cic; ) = tr(P) =k, 
i=l i=l 


I 


since the eigenvectors c; of C are an orthogonal set of n directions. A linear function is maximized 
at one of the vertices of its polytope of feasible solutions. The vertices of this polytope defined by 
the constraints 0 < 6; < 1 and X; ð; = k are those 6 vectors with exactly k ones and n — k zeros. Thus 
the vertices of the polytope correspond to sets of size k and 


k 
mG) ne es Lr 
Clearly the set that gives the maximum upper bound corresponds to the largest k eigenvalues of C 
and tr(P*C) equals the above upper bound when P* consists of the eigenvectors corresponding to 
the set of k largest eigenvalues. 
In the online setting, learning proceeds in trials. At trial t the algorithm chooses a center m7 
and a rank k projection matrix P’~!. It then receives an instance x’ and incurs loss 


1 


Io mt) = Pd nt DB = (= PF E m mT). 


Note that this is the compression loss of the center m~! and subspace P“! on the instance x’. Our 
goal is to obtain an algorithm whose total online compression loss over the entire sequence of T 
trials ZZ tr((Z— P’~!) (x! — m'~!)(x¢ —m'~!)") is close to the total compression loss (1) of the best 
center m“ and best rank k projection matrix P* chosen in hindsight by the batch algorithm. 
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3. Learning as Well as the Best Expert with the Hedge Algorithm 


The following setup and algorithm will be the basis of this paper. The algorithm maintains a prob- 
ability distribution w’~! over n experts. At the beginning of trial t it chooses an expert probabilis- 
tically according to the probability vector w’—!, that is, expert i is chosen with probability wit, 
Then a loss vector # € [0,1]” is received, where £ specifies the loss of expert i incurred in trial t. 
The expected loss of the algorithm will be w’~! - ’, since the expert was chosen probabilistically. 
At the end of the trial, the probability distribution is updated to w’ using exponential update factors 


(See Algorithm 1). This is essentially the Hedge Algorithm of Freund and Schapire (1997). In the 


Algorithm 1 Hedge Algorithm 


input: Initial n-dimensional probability vector w? 
fort = 1 to T do 

Draw an expert i with probability wi~ 

Receive loss vector ¢’ 

Incur loss £& 

and expected loss w‘—! - # 
7a a 
j=1Wj p(=n6,) 

end for 





1 





original version the algorithm proposes a distribution w'™! at trial ż and incurs loss w’! - Æ (instead 
of drawing an expert from w‘—! and incurring expected loss w’~! - 4). 

It is easy to prove the following bound on the total expected loss. Here d(u,w) denotes the 
relative entropy between two probability vectors d(u,w) = X}; ujlog F and log is the natural log- 
arithm. 


Theorem 1 For an arbitrary sequence of loss vectors t',...,€7 € (0,1]", the total expected loss of 
Algorithm 1 is bounded as follows: 


T 
yw! Ø < 
t=1 


for any learning rate y > 0 and comparison vector u in the n dimensional probability simplex. 


NE u: l + (d(u,w?) -d(u,w")) 
1 —exp(—n) 





’ 


Proof The update for w’ in Algorithm 1 is essentially the update of the Continuous Weighted 
Majority Algorithm where the absolute loss of expert i is replaced by ¢). Since Æ € [0,1], we 
have exp(—n@;) < 1— (1 —exp(—n))@ and this implies (essentially Littlestone and Warmuth 1994, 
Lemma 5.2, or Freund and Schapire 1997): 





n 
-log $ wi! exp(—né4) > —log(1— (1 — exp(—n))w'! -€)) > ww"! (1 —exp(-n)). 
i=1 
The above can be reexpressed with relative entropies as follows (Kivinen and Warmuth, 1999): 


-nul —log} wi! exp(—nf) 
i=1 


-nu-l +w'!-e(1 —exp(—n)). (2) 


d(u,w'~') —d(u,w’) 


IV 
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The bound of theorem can now be obtained by summing over trials. | 


The original Weighted Majority algorithms were described for the absolute loss (Littlestone and 
Warmuth, 1994). The idea of using loss vectors instead was introduced in Freund and Schapire 
(1997). The latter paper also shows that when £, u: Č < L and d(u,w®) — d(u,wT) < D < logn, 
then with n = log(1 + \/2D/L), we get the bound 


Ywl- < Yul +V2LD+d(u,wo)—d(u,w"). (3) 
t t 


By setting u to be the vector with a single one identifying the best expert, we get the following 
bound on the regret of the algorithm (Again log denotes the natural logarithm.): 





total loss of alg. — total loss of best expert < yV 2 (total loss of best expert) logn + logn. 


4. Learning as Well as the Best Subset of Experts 


Recall that projection matrices are symmetric positive definite matrices with eigenvalues in {0,1}. 
Thus a rank k projection matrix can be written as P = ES Pi pi, where the p; are the k orthonor- 
mal vectors forming the basis of the subspace. Assume for the moment that the eigenvectors are 
restricted to be standard basis vectors. Now a projection matrix becomes a diagonal matrix with 
k ones in the diagonal and n — k zeros. Also, the trace of a product of such a diagonal projection 
matrix and any symmetric matrix specifying the loss becomes a dot product between the diagonals 
of both matrices. The diagonal of the symmetric matrix may be seen as a loss vector ¢’. Thus, in 
this simplified diagonal setting, our goal is to develop online algorithms whose total loss is close to 
the sum of the lowest n — k components of total loss vector yy Č. Equivalently, we want to find 
the highest k components of the total loss vector and per our nomenclature the loss of the lowest 
n — k components is the compression loss of the complementary highest k components. 

For this problem, we will encode the subsets of size n — k as probability vectors: we call r € 
(0, 1)” an (n—k)-corner if it has n —k components fixed to -+ and the remaining k components fixed 
to zero. The algorithm maintains a probability vector w‘ as its parameter. At trial t it probabilistically 
chooses an (n — k)-corner r based on the current probability vector w’! (Details of how this is done 
will be given shortly). The set of k components missed by r is the set P’~! that we compress with 
at trial £. The algorithm then receives a loss vector ¢’ and incurs compression loss (n — k) r- # = 
LYie{1,....n}-p! G. Finally the weight vector w'—! is updated to w’. 

We now describe how the corner is chosen: The current probability vector is decomposed into a 
mixture of n corners and then one of the n corners is chosen probabilistically based on the mixture 
coefficients. In the description of the decomposition algorithm we use d = n—k for convenience. Let 
A’, denote the convex hull of the (") corners of size d (where 1 < d <n). Clearly, any component 
wji of a vector w in the convex hull is at most 4 because it is a convex combination of numbers 
in {0, i}. Therefore A’, C B}, where BY, is the capped probability simplex, that is, the set of n- 
dimensional vectors w for which |w| = Y;w; = 1 and 0 < w; < is for all i. Figure 1 depicts the 
capped probability simplex for case d = 2 and n = 3,4. The following theorem shows that the 
convex hull of the corners is exactly the capped probability simplex, that is, A% = BY. It shows this 
by expressing any probability vector in the capped simplex B% as a convex combination of at most 
n d—corners. For example, when d = 2 and n = 4, B% is an octahedron (which has 6 vertices). 
However, each point in this octahedron is contained in a tetrahedron which is the hull of only 4 of 
the 6 total vertices. 


2292 


ONLINE PCA 










E 2,0,2,0) 


(%,0,%) (0,%,%) 


(2,2,0) 


Figure 1: The capped probability simplex B”, for d = 2 and n = 3,4. This simplex is the intersection 
of n halfspaces (one per capped dimension) and its vertices are the (5) d-corners. 


1/3 of total weight 




















Figure 2: A step of the Mixture Decomposition Algorithm 2, n = 6 and k = 3. When a corner is 
removed, then at least one more component is set to zero or raised to a d-th fraction of the 
total weight. The left picture shows the case where a component inside the corner gets set 
to zero and the right one depicts the case where a component outside the picked corner 
gets d-th fraction of the total weight. 


Theorem 2 Algorithm 2 decomposes any probability vector w in the capped probability simplex B", 
into a convex combination? of at most n d-corners. 


Proof Let b(w) be the number of boundary components in w, that is, b(w) = |{i : w; is 0 or wiy, 
Let Bn be all vectors w such that 0 < w; < w, for all i. If b(w) = n, then w is either a corner or 
0. The loop stops when w = 0. If w is a corner then it takes one more iteration to arrive at 0. We 
show that if w € B? and w is neither a corner nor 0, then the successor W lies in B’ and b(W) > b(w). 
Clearly, W > 0, because the amount that is subtracted in the d components of the corner is at most as 
large as the corresponding components of w. We next show that W; < i If i belongs to the corner 





that was chosen then Ŵ; = w; — 4 < wig = i, Otherwise W; = w; < l, and I < m follows from 
the fact that p < |w| — d l. This proves that w € B}. 





2. The existence of a convex combination of at most n corners is implied by Carathéodory’s theorem (Rockafellar, 
1970), but Algorithm 2 gives an effective construction. 
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Algorithm 2 Mixture Decomposition 
input 1 < d < n and w € B} 
repeat 
Let r be a corner for a subset of d non-zero components of w 
that includes all components of w equal to wi 
Let s be the smallest of the d chosen components of r 
and / be the largest value of the remaining n — d components 


w := w — min(d s, |w| — d l) r and output pr 
— 





P 
until w = 0 





For showing that b(W) > b(w) first observe that all boundary components in w remain boundary 
ania 
=" = 


components in W: zeros stay zeros and if w; = bl then i is included in the corner and W; 
mi, However, the number of boundary components is increased at least by one because the com- 
ponents corresponding to s and / are both non-boundary components in w and at least one of them 
becomes a boundary point in W: if p = d s then the component corresponding to s in w is s— 4 = 0 
in W, and if p = |w| — d l then the component corresponding to / in w is l = Wp = ue It follows 
that it may take up to n iterations to arrive at a corner which has n boundary components and one 
more iteration to arrive at 0. Finally note that there is no weight vector w € Bn s.t. btw) =n-1 
and therefore the size of the produced linear combination is at most n. More precisely, the size is at 
most n — b(w) ifn — b(w) < n—2 and one if w is a corner. 

The algorithm produces a linear combination of (n — k)-corners, that is, w = £; pjrj. Since 
pj 2 Oand all |r;| = 1, j Pj = 1 and we actually have a convex combination. a 


It is easy to implement the Mixture Decomposition Algorithm in O(n?) time: simply sort w and 
spend O(n) per loop. 
The batch algorithm for the set problem simply picks the best set in a greedy fashion. 


Fact 1 For any loss vector £, the following corner has the smallest loss of any convex combination 
of corners in A’, = B’,; Greedily pick the component of minimum loss (d times). 


How can we use the above mixture decomposition and fact to construct an online algorithm? 
It seems too hard to maintain information about all ( us P) corners of size n — k. However, the best 
corner is also the best convex combination of corners, that is, the best from the set A”?_, where each 
member of this set is given by ( Bis 5) coefficients. Luckily, this set of convex combinations equals 
the capped probability simplex B”_, and it takes only n coefficients to specify a member in B”_,. 
Therefore we can maintain a parameter vector in BY, and for any such capped vector w, Algorithm 
2 decomposes it into a convex combination of at most n many (n —k)-corners. This means that 
any algorithm producing a hypothesis vector in B”_, can be converted to an efficient algorithm that 
probabilistically chooses an (n — k)-corner. 

Algorithm 3 spells out the details for this approach. The algorithm chooses a corner probabilis- 
tically and (n — k) w7! - Æ is the expected loss at trial t. After updating the weight vector w’—! by 
multiplying with the factors exp(—n&) and renormalizing, the resulting weight vector W might lie 
outside of the capped probability simplex BY _,. We then use a Bregman projection with the relative 
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Algorithm 3 Capped Hedge Algorithm 

input: 1 < k < n and an initial probability vector w? € B? 

fort = 1 to T do 
Decompose w‘! into a convex combination Y jPjrj of at most n corners rj 

by applying Algorithm 2 with d = n — k 

Draw a corner r = r j with probability p; 
Let P’—! be the k components outside of the drawn corner r 
Receive loss vector 4 
Incur compression loss (n—k)r-€ = Xie{1,.. npe- G 





and expected compression loss (n — k) w'! - ¢' 

wit exp(—n&) 

Ei exp(—né}) 

w = cap,_,(’) where cap,,_;(.) invokes Algorithm 4 





Update: wi = 


end for 





Algorithm 4 Capping Algorithm 
input probability vector w, set size d 
Let w! index the vector in decreasing order, that is, wi = max(w) 
if max(w) < 4 then 
return w 
end if 
i=l 
repeat 
(* Set first i largest components to i and normalize the rest to di *) 





w=w 
w= 4, for jal... 


al 
oe ae; Ww. 
wy, = i n : Paka 
, , Lain W7 
i:=i+ 1 
. ~ 1 
until max(w) <j 
return w 





for j=i+1...n 





entropy as the divergence to project the intermediate vector W back into B? y 


: At 
w' = argmin d(w,w’). 
weBr y 


This projection can be achieved as follows (Herbster and Warmuth, 2001): find the smallest i s.t. 
capping the largest i components to -A and rescaling the remaining n — i weights to total weight 
1— Ea makes none of the rescaled weights go above E The simplest algorithm starts with sorting 
the weights and then searches for i (see Algorithm 4). However, a linear time algorithm is given in 
Herbster and Warmuth (2001)? that recursively uses the median. 





3. The linear time algorithm of Figure 3 of that paper bounds the weights from below. It is easy to adapt this algorithm 
to the case of bounding the weights from above (as needed here). 
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When k = n—1 and d = n— k = 1, B} is the entire probability simplex. In this case the call 
to Algorithm 2 and the projection onto BY are vacuous and we get the standard Hedge Algorithm 
(Algorithm 1) as a degenerate case. Note that (n —k)Y?_,u-¢' is the total compression loss of 
comparator vector u. When u is an (n —k)-corner, that is, the uniform distribution on a set of size 
n—k, then (n—k) £7; u- č is the total loss of this set. 


Theorem 3 For an arbitrary sequence of loss vectors €',...,¢' € (0,1]", the total expected com- 
pression loss of Algorithm 3 is bounded as follows: 





wich. pc MK) EL wl + (nk) (du?) —d(u,w")) 
7 1 —exp(—n) 


Mns 


(n—k) 


’ 


~ 
Il 


for any learning rate n > 0 and comparison vector u € B? _,. 


Proof The update for W in Algorithm 3 is the same as update for w’ in Algorithm 1. Therefore we 
can use inequality (2): 


d(u,w'') —d(u,w') > —nu-@+w'!-@(1—exp(—n)). 


Since the relative entropy is a Bregman divergence (Bregman, 1967; Censor and Lent, 1981), the 
weight vector w‘ is a Bregman projection of vector W onto the convex set BY _,- For such projections 
the Generalized Pythagorean Theorem holds (see, e.g., Herbster and Warmuth, 2001, for details): 


d(u,w') > d(u,w')+d(w', iw). 


Since Bregman divergences are non-negative, we can drop the d(w’,#’) term and get the following 
inequality: 
d(u, W) —d(u,w’) > 0, for u € B}. 


Adding this to the previous inequality we get: 
d(u,w'!) —d(u,w') > —nu-@ +w"! -#(1— exp(—1)). 


By summing over t, multiplying by n — k, and dividing by 1 — exp(—n), the bound follows. E 


It is easy to see that (n — k)(d(u,w?) —d(u,w" )) < (n—k) log “4 and this is bounded by klog ? 
when k < n/2. By tuning 17 as in (3), we get the following regret bound: 


(expected total compression loss of alg.) - (total compression loss of best k-subset) 


k<n/2 n n 
< y 2(total compression loss of best k-subset) k log k +klog T (4) 





The last inequality follows from the fact that (n — k) log -4z < klog ġ when k < n/2. Note that the 
dependence on k in the last regret bound is essentially linear and dependence on n is logarithmic. 
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4.1 Alternate Algorithms for Learning as Well as the Best Subset 


The question is whether projections onto the capped probability simplex are really needed. We 
could simply have one expert for each set of n — k components and run Hedge on the (i set 
experts, where the loss of a set expert is always the sum of the n — k component losses. The 
set expert {i1,...,i,-,} receives weight proportional to exp(— ne, es = 1m exp(—f"), where 
i = ae , 2). These product weights can be maintained implicitly: keep one weight per component 
where the ith component receives weight exp(—5"), and use dynamic programming for summing 
the produced weights over the ess A sets and for choosing a random set expert based on the product 
weights. See, for example, Takimoto and Warmuth (2003) for this type of method. While this dy- 
namic programming algorithm can be made reasonably efficient (O(n? (n — k)) per trial), the range 
of the losses of the set experts is now [0,n—k] and this introduces factors of n — k into the tuned 


regret bound: 





y 2(total compression loss of best k-subset) (n — k) klog = + (n —k) klog 7 (5) 


Curiously enough our new capping trick avoids these additional factors in the regret bound by 
using only the original n experts whose loss is in [0,1]. We do not know whether the improved 
regret bound (4) (i.e., no additional n — k factors) also holds for the sketched dynamic programming 
algorithm. However, the following example shows that the two algorithms produce qualitatively 
different distributions on the sets. 

Assume n = 3 and k = 1 and the update factors exp(—1é5") for experts 1, 2 and 3 are propor- 
tional to 1, 2, and 4, respectively, which results in the normalized weight vector (5, Z, +). Capping 
the weights at A = 5 with Algorithm 4 produces the following vector which is then decomposed 
via Algorithm 2: 

111 1,11 2,.1 1 
EFD = 3 (51%) + 301555). 
SS Se 

set {1,3} set {2,3} 





(6) 


On the other hand the product weights exp(—n@;*") x exp(—n¢;") of the dynamic programming 
algorithm for the three sets {1,2}, {1,3} and {2,3} of size 2 are 1 x2, 1 *4, and 2 «4, respectively. 
That is, the dynamic programming algorithm gives (normalized) probability L, 2 and 4 to the three 
sets. Notice that Capped Hedge gives expert 3 probability 1 (since it is included in all corners of the 
decomposition (6)) and the dynamic programming algorithm gives expert 3 probability 3 the total 
probability it has assigned to the two sets {1,3} and {2,3} that contain expert 3. 

A second alternate is the Follow the Perturbed Leader (FPL) Algorithm (Kalai and Vempala, 
2005). This algorithm adds random perturbations to the losses of the individual experts and then 
selects the set of minimum perturbed loss as its hypothesis. The algorithm is very efficient since it 
only has to find the set with minimum perturbed loss. However its regret bound has additional fac- 
tors in addition to the n — k factors appearing in the above bound (5) for the dynamic programming 
algorithm. For the original Randomized Hedge setting with just n experts (Section 3), a distribu- 
tion of perturbations was found for which FPL simulates the Hedge exactly (Kalai, 2005; Kuzmin 
and Warmuth, 2005) and therefore the additional factors can be avoided. However we don’t know 
whether there is a distribution of additive perturbations for which FPL simulates Hedge with set 
experts. 
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5. Uncentered Online PCA 


We create an online PCA algorithm by lifting our new algorithm for sets of experts based on capped 
weight vector to the matrix case. Now matrix corners are density matricest with d eigenvalues 
equal to l and the rest are 0. Such matrix corners are just rank d projection matrices scaled by 
i: (Notice that the number of matrix corners is uncountably infinite.) We define the set -4% as the 
convex hull of all matrix corners. The maximum eigenvalue of a convex combination of symmetric 
matrices is at most as large as the maximum eigenvalue of any of the individual matrices (see, e.g., 
Bhatia, 1997, Corollary III.2.2). Therefore each convex combination of corners is a density matrix 
whose eigenvalues are bounded by i and 4% C B’, where B% consists of all density matrices whose 
maximum eigenvalue is at most i. Assume we have some density matrix W € B% with eigendecom- 


position W diag(@) W T. Algorithm 2 can be applied to the vector of eigenvalues © of this density 
matrix. The algorithm decomposes @ into at most n diagonal corners rj: œ = }; pjr;. This convex 
combination can be turned into a convex combination of matrix corners that decomposes the density 
matrix: W =), p; Wdiag(rj)W T, Tt follows that A’, = Bh, as in the diagonal case. 

As discussed before, losses can always be viewed in two different ways: the loss of the al- 
gorithm at trial ¢ is the compression loss of the chosen projection matrix P’—! or the loss of the 
complementary subspace J — P’~!, that is, 


Pog x 2 _ tf Į— p! x xt T: : 
le IIs =t(( Jx œ) ) 


rank k rank n—k 


Our online PCA Algorithm 5 has uncertainty about which subspace of rank n — k is best and it rep- 
resents this uncertainty by a density matrix W‘! € 4” _,, that is, a mixture of (n — k)-dimensional 
matrix corners. The algorithm efficiently samples a subspace of rank n — k from this mixture and 
uses the complementary subspace P’~! of rank k for compression. The expected compression loss 
of algorithm will be (n —k)tr(W’!xx'). 

The following lemma shows how to pick the best matrix corner. When S = Z2; x(x')', then 
this lemma justifies the choice of the batch PCA algorithm. 


Theorem 4 For any symmetric matrix S, minweg» (WS) attains its minimum at the matrix corner 
formed by choosing d orthogonal eigenvectors of S of minimum eigenvalue. 


Proof Let A!(W) denote the vector of eigenvalues of W in descending order and let 4'(S) be the 
same vector of S but in ascending order. Since both matrices are symmetric, tr(WS) > A!(W) - 
(8) (Marshall and Olkin 1979, Fact H.1.h of Chapter 9, we will sketch a proof below). Since 
Al (W) € B", the dot product is minimized and the inequality is tight when W is a d-corner (on 
the n-dimensional probability simplex) corresponding to the d smallest eigenvalues of S. Also the 
greedy algorithm finds the solution (see Fact 1 of this paper). 

For the sake of completeness, we will sketch a proof of the inequality tr(WS) > A!(W) -A1 (S). 
We begin by rewriting the trace using an eigendecomposition of both matrices: 


tr(WS) = tr( owiw, ojsjsl) =) O;Oj (wi sjy. 
:=M;i, j 





4. Density matrix is a symmetric positive definite matrix of trace 1, that is, they are symmetric matrices whose eigen- 
values form a probability vector 
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The matrix M is doubly stochastic, that is, its entries are nonnegative and its rows and columns sum 
to 1. By Birkhoff’s Theorem (see, e.g., Bhatia, 1997), such matrices are the convex combinations of 
permutations matrices (matrices with a single one in each row and column). Therefore the minimum 
of this linear function occurs at a permutation, and by a swapping argument one can show that the 
permutation which minimizes the linear function is the one that matches the ith smallest eigenvalue 
of W with the (n — i)th largest eigenvalue of S. a 


We obtain our algorithm for online PCA (Algorithm 5) by lifting Algorithm 3 for set experts to 
the matrix setting. The exponential factors used in the updates of the expert setting are replaced by 
the corresponding matrix version which employs the matrix exponential and matrix logarithm (War- 
muth and Kuzmin, 2006a).> For any symmetric matrix A with eigendecomposition £; O;a;a; , the 
matrix exponential exp(A) is defined as the symmetric matrix Y"_, exp(0;)a;a; . Observe that the 
matrix exponential exp(A) (and analogously the matrix logarithm log(A) for symmetric positive 
definite A) affects only the eigenvalues and not the eigenvectors of A. 

The following theorem shows that for the Bregman projection we can keep the eigensystem 
fixed. Here the quantum relative entropy A(U,W) = tr(U(logU — logW)) is used as the Bregman 
divergence. 


Theorem 5 Projecting a density matrix onto B% w.r.t. the quantum relative entropy is equivalent to 

projecting the vector of eigenvalues w.rt. the “normal” relative entropy: If W has the eigendecom- 
a T 

position W diag(@)W , then 


argmin A(U,W) = Wu" W! , where u* = argmin d(u,œ). 
UEBI ucBy 
Proof The quantum relative entropy can be rewritten as follows: 


A(U,W) = t(U logU) — t(U log W) = A(U) - log(A(U)) — t(U log W), 


where A(U) denotes the vector of eigenvalues of U and log is the componentwise logarithm of a 
vector. For any symmetric matrices S and T, tr(ST) < A! (S)-A!(T) (Marshall and Olkin 1979, Fact 
H.1.g of Chapter 9; also see proof sketch of a similar fact given in previous theorem). This implies 
that 


A(U,W) > MU) -log(A(U)) = X (U) -At (log(W)) = MU) -log(A(U)) -X (U) - log a!(W). 


Therefore min A(U,W) > min d(u,@), and if u* minimizes the r.h.s. then W diag(u*)W! mini- 
UEB ucB”, 
mizes the 1.h.s. because A( W diag(u*)W,W) = d(u*, œ). a 


The lemma means that the projection of a density matrix onto B)_, is achieved by applying Algo- 
rithm 4 to the vector of eigenvalues of the density matrix. 

We are now ready to prove a worst-case loss bound for Algorithm 5 for the uncentered case of 
online PCA. Note that the expected loss in trial ¢ of this algorithm is (n — k)tr(W‘!x'(x‘)'). When 
U is a matrix corner then (n — k) Z7; tr(Ux (xt) |) is the total loss of the corresponding subspace. 





5. This update step is a special case of the Matrix Exponentiated Gradient update for the the linear loss tr(Wx! (x)! ) 
(Tsuda et al., 2005). 
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Algorithm 5 Uncentered online PCA algorithm 





input: 1 < k < n and an initial density matrix W? € BY_, 
fort = 1 to T do 
Perform eigendecomposition W! = Waw! 
Decompose @ into a convex combination )) ; pjr; of at most n corners rj 
by applying Algorithm 2 with d = n — k 
Draw a corner r = r; with probability p; 
Form a matrix corner R = W diag(r) W" 
Form a rank k projection matrix P'™! = I — (n — k)R 
Receive data instance vector x" 
Incur compression loss ||x’ — P 1x |3 = tr((I — Pt) x (x)! ) 
and expected compression loss (n — k)tr(W txt (x) ') 
exp(log W7—! -nx (x£)! ) 
tr(exp(log W"! -nx (x')")) 
W= cap,_,(W’), 
where cap,,_;(A) applies Algorithm 4 to the vector of eigenvalues of A 


Update: W = 





end for 





Theorem 6 For an arbitrary sequence of data instances x! ,... x" of 2-norm at most one, the total 


expected compression loss of the algorithm is bounded as follows: 


(n—k)te(WI lx’ (x!) ") 


Mr 


1 


Il 


n(n =k) Er (Ux (x)' ) + (n= k) (AU, W?) — AU, W7)) 


< 
7 1 —exp(—n) 





’ 
for any learning rate ņ > 0 and comparator density matrix U € B_y. 


Proof The update for W isa density matrix version of the Hedge update which was used for 
variance minimization along a single direction (i.e., k = n — 1) in Warmuth and Kuzmin (2006a). 
The basic inequality (2) for that update becomes: 


A(U,W"!) —A(U,W’) > —ntr(Ux (x) ") +te(W tx (x) (1 —exp(—n)). 


As in the proof of Theorem 3 of this paper, the Generalized Pythagorean Theorem applies and 
dropping one term we get the following inequality: 


A(U,W') —A(U,W’) > 0, for U € B"_,. 
Adding this to the previous inequality we get: 
A(U,W'"!) —A(U,W’) > —ntr(Ux! (x7) 1) + tr(Wt tx (x7) ) — exp(—n)). 
By summing over t, multiplying by n — k, and dividing by 1 — exp(—7n), the bound follows. | 
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It is easy to see that (n—k)(A(U,W°) —A(U,W*)) < (n—k) log 4 and this is bounded by klog # 
when k < n/2. By tuning 1 as in (3), we can get regret bounds of the form: 


(expected total compression loss of alg.) - (total compression loss of best k-subspace) 


k<n/2 n n 
< y 2(total compression loss of best k-subspace) klog k +klog p (7) 





Let us complete this section by discussing the minimal assumptions on the loss functions needed 
for proving the regret bounds obtained so far. Recall that in the regret bounds for experts as well 
as set experts we always assumed that the loss vector ¢' received at trial ¢ lies in [0, 1]”. In the case 
of uncentered PCA, the loss at trial t is specified by an instance vector x’ that has 2-norm at most 
one. In other words, the single eigenvalue of the instance matrix x’(x’)' must be bounded by 1. 
However, it is easy to see that the regret bound of the previous theorem still holds if at trial ¢ the 
instance matrix x(x)! is replaced by any symmetric instance matrix S’ whose vector of eigenvalues 
lies in [0, 1]”. 


5.1 Alternate Algorithms for Uncentered Online PCA 


We conjecture that the following algorithm has the regret bound (7) as well: run the dynamic pro- 
gramming algorithm for the set experts sketched in Section 4 on the vector of eigenvalues of the 
current covariance matrix. The produced set for size k is converted to a projection matrix of rank 
k by replacing it with the k outer products of the corresponding eigenvectors. We are not elaborat- 
ing on this approach since the algorithm inherits the additional n — k factors contained in the regret 
bound (5) for set experts. If these factors in the regret bound for set experts can be eliminated then 
this approach might lead to a competitive algorithm. 

Versions of FPL might also be used to design an online PCA algorithm for compressing with a 
k dimensional subspace. Such an algorithm would be particularly useful if the same regret bound 
(7) could be proven for it as for our online PCA algorithm. The question is whether there exists 
a distribution of additive perturbations of the covariance matrix for which the loss of the subspace 
formed by the eigenvectors of the n — k smallest eigenvalues simulates a matrix version of Hedge 
on subspaces of rank n — k and whether this algorithm does not have the n — k factors in its bound. 
Note that extracting the subspace formed by the eigenvectors of the n — k smallest (or k largest) 
eigenvalues might be more efficient than performing a full eigendecomposition. 


6. Centered Online PCA 


In this section we extend our online PCA algorithm to also estimate the data center online. Under 
the extended protocol, the algorithm needs to produce both a rank k projection matrix P’~! and a 
data center m~! at trial t. It then receives a data point x’ and incurs compression loss ||(x’ —m‘~!) — 
P'~!(x! —m'—!)||5. As for uncentered online PCA, we will use a capped density matrix W‘~! to 
represent the algorithm’s uncertainty about the hidden subspace. 


6.1 Motivation of the Updates 


We begin by motivating the updates of all the algorithms analyzed so far. We follow Kivinen and 
Warmuth (1997) and motivate the updates by minimizing a tradeoff between a parameter divergence 
and a loss function. Here we also have the linear capping constraints. Since our loss is linear, the 


2301 


WARMUTH AND KUZMIN 


tradeoff minimization problem can be solved exactly instead of using approximations as is done in 
Kivinen and Warmuth (1997) for non-linear losses. Updates motivated by exact solution of tradeoff 
minimization problems involving non-linear loss functions are sometimes called implicit updates 
since they typically do not have a closed form (Kivinen et al., 2005). Even though the loss function 
used here is linear, the additional capping constraints are responsible for the fact that there is again 
no closed form for the updates. Nevertheless our algorithms are always able to compute the optimal 
solutions of the tradeoff minimization problems defining the updates. 

We begin our discussion of motivations of updates with the set expert case. Consider the fol- 
lowing two updates: 


w = arginf,<pr , (m'd(w,w™!) +w- L), (8) 
t 

w = arginf cpr, (muwa Ewer) (9) 
q=1 


In the motivations of all our updates, the divergences are always versions of relative entropies which 
are special cases of Bregman divergences. Here d denotes the standard relative entropy between 
probability vectors. The first update above trades off the divergence to the last parameter vector 
with the loss in the last trial. The second update trades off the divergence to the initial parameter 
with the total loss in all past trials. In both cases the minimization is over B”_, which as we recall 
is the n-dimensional probability simplex with the components capped at a. One can show that 
the combined two update steps of the Capped Hedge Algorithm 3 coincide with the first update (8) 
above. The solution to (8) has the following exponential form: 


t—-1 


tw texp (nel +1) 
Ei wh exp nE +Y) 


l 

where ¥; is the Lagrangian coefficient that enforces the cap on the weight wi. The non-negativity 
constraints don’t have to be explicitly enforced because the relative entropy is undefined on vectors 
with negative elements and thus acts as a barrier function. Because of the capping constraints, the 
two updates (8) and (9) given above are typically not the same. However when k = n — 1, then 
B"_, = B} is the entire probability simplex and the y; coefficients disappear. In that case both 
updates agree and motivate the update of vanilla Hedge (Algorithm 1) (See Kivinen and Warmuth, 
1999). 

Furthermore, the above update (8) can be split into two steps as is done in Algorithm 3: the first 
update step uses exponential factors to update the probability vector and the second step performs a 
relative entropy projection of the intermediate vector onto the capped probability simplex. Here we 
give the sequence of two optimization problems that motivate the two update steps of Algorithm 3: 





w= arging,,.>0, £w;=1 (n7'd(w,w'!) +w- L) ; 
w; = arginfpeg" d(w,®). 


For the motivation of the uncentered online PCA update (Algorithm 5), we replace the relative 
entropy d(w,w’~!) between probability vectors in (8) by the Quantum Relative Entropy A(W,W*7!) = 
tr(W (log W —logW‘~!)) between density matrices. Furthermore, we change the loss function from 
a dot product to a trace: 


wW = arginfycge , (naw!) + tr(W2" (x")")) 
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Recall that By, is the set of all n x n density matrices whose maximum eigenvalue is at most a. 
Note that in Algorithm 5, this update is again split into two steps. 

The case of centered online PCA, which we will address now, is the most interesting because 
now we have two parameters. We use the following update which uses a divergence to the initial 
parameters (as in (9)): 


(W',m') = — arginf (n-1A(W,W°) +a) (n-m?) Wm —m) J 
WEB” p, mER” 
+ >. tr(W (x4 —m)(x4—m)"). (10) 
q=1 


Notice that we have two learning rates: n for the density matrix parameter and 1 for the center 
parameter. The above update may be viewed as a maximum a posteriori estimator since the diver- 
gences act as priors or initial examples and the inverse learning rates that multiply the divergences 
determine the importance of the priors (See, e.g., Azoury and Warmuth, 2001, for a discussion). 
When n~! = 77! = 0, then there are no priors and the update become the Maximum Likelihood 
estimator or Follow the Leader (FL) Algorithm. If 77! — œ, then m‘ is clamped to the fixed center 
m°. If further m? = 0, then the above motivation becomes a motivation for an uncentered update 
with a divergence to the initial density matrix W? (analogous to (9)). Similarly, when n~! — ©, 
then W‘ is clamped to the fixed density matrix W° and the resulting optimization problem motivates 
the Incremental Off-line Algorithm for Gaussian density estimation with a fixed covariance matrix 
(Azoury and Warmuth, 2001). 

As in Kuzmin and Warmuth (2007), we analyze this update for centered PCA by rewriting its 
optimization problem (10) as the dual maximization problem. The constraint W € B%_; in equa- 
tion (10) is equivalent to having constraints tr(W) = 1 and W x -HI . The constraint W = 0 is 
automatically enforced since the quantum relative entropy acts as a barrier. With this in mind, we 
write down the Lagrangian function, where U’(W,m) is the objective function of our optimization 
problem (10) that includes data points from ż trials, 6 is the dual variable for the trace constraint and 


the symmetric positive definite matrix I is the dual variable for the capping constraint: 
1 
L'(W,m,T,8) = U'(W,m) + 8(tr(W) — 1) +r ((W — ph) 
n — 
The optimization over m is unconstrained, giving the solution for m‘: 


a~l 0 t q 
N m + =e 








= 11 

m =r (11) 

This is essentially the normal mean of an extended sample, where we added 1! copies of m? to 

x!,...,x1. To write down the form of the solution for Wt compactly we will introduce the following 
matrix: 

t 
C =! (m —m’)(m® — mm!) +Y (x4 —m') (x4 —m')". (12) 
q=1 


This can be seen as the extended sample covariance matrix where we added )~! 


mo. 


copies of instance 
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Setting the derivatives to zero and solving (see Tsuda et al. 2005 for similar derivation), we 
obtain the following form of W’ in terms of the dual variables 6! = nò and T: 


wW (èT) =exp(logW° — nC’ — 8'7—n1). 


The constraint tr(W) = 1 is enforced by choosing 8’ = logtr(exp(log W; — nC’ —T)). By substitut- 
ing W‘(6’,I) and the formula for m’ into the Lagrangian L’ and simplifying, we obtain the following 
dual problem: 


Ti T! (I), where T (T) = =e log tr(exp(logW° nc nl)) uy 
Lees n — 





(13) 
Let I” be the optimal solution of the dual problem above and let cap,(W) be the density matrix 
obtained when the capping Algorithm 4 is applied to the vector of eigenvalues of W and capping 


parameter d. This lets us express W‘ as: 


(14) 








, exp(logW?—nC’—nI") e ( exp(log W° — nC’) ) 
~ tr(exp(logW° -næ -nr“)) Prk tr(exp(log W? — nC")) 7 


For the analysis we express m' and C’ as online updates: 
Lemma 7 The estimates of mean and covariance can be updated as follows: 


t (qh t+t—Dm' t+ td 1 t—1 
m = a =m = 
Qo +t nN +t 








N+- 1l 


C= SCL] 
N! +t 


(x mY — m!) 





Proof The update rule for m‘ is easy to verify. For the update of C’, we start by expanding the 
expression (12) for C’~!: 


cu =: 7 1 (m°(m°)! — m? (m!) — m! (m)! +m" (m!) ) 





KEO AT mT +x (x1)! ) 
= Eí (x7) 7 +A Taye 1)m'~ (m1) 


t—1 
—(7 1m? + F xa) m1) — m! (N tm? + y x4)" )' +A m? (m Oe. 
q=1 q=1 


By substituting 


1—1 
N im? + v= (M! +t-1)m'! 
q=1 


we get the following: 
= Leon" — (AIH 1m (mT HÄ m m). 
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Algorithm 6 Centered Online PCA Algorithm 
input: 1 < k < n and an initial offset m?, initial density matrix We Bibs C°=0 
fort = 1 to T do 
Perform eigendecomposition W=! = Waw! 
Decompose @ into a convex combination )) ; pjr; of at most n corners rj 
by applying Algorithm 2 with d =n — k 
Draw corner r = r; with probability p; 





Form a matrix corner R = W diag(r) W" 
Form a rank k projection matrix P! = I — (n — k)R 
Receive data instance vector x" 
Incur compression loss 
I] (x = m!) =P e m !)||2 = tr((I — Pp) (x! _ m'—!) (x! zy m'~!)T) 
and expected compression loss (n — k)tr(W’~! (x! — m=!) (xt —m‘~!)") 
Update: 





1 
seed -1 
m = m ene x) (15) 
Nn +t- 1 
Č = De Of =m! (x — m!) (16) 
SF on exp(logW° — nC’) 





tr(exp(logW° -n C')) 
At 
WwW = capp W), 


where cap,,_;(A) applies Algorithm 4 to the vector of eigenvalues of A 
end for 





Now the update for C can be written as: 
C= ci} Ẹ (m! +t— 1m "(mi)" +x (fy? — (q! +t)m'(m')'. 


Substituting the left update for m‘ from the statement of the lemma and simplifying gives the desired 
online update for C’. E 


: All the steps for the Centered Online PCA Algorithm are summarized as Algorithm 6. We already 
reasoned that the capping and decomposition steps are O(n7). The remaining expensive step is 
maintaining the eigendecomposition of the covariance matrix for computing the matrix exponential. 
Using standard rank one update techniques for the eigendecomposition of a symmetric matrix, this 
costs O(n”) per trial (see, e.g., Gu and Eisenstat, 1994). 


6.2 Regret Bound for Centered PCA 


The following theorem proves a regret bound for our Centered Online PCA Algorithm. 
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Theorem 8 For any data sequence x',...,x", initial center value m? such that ||x' —m||2 < 5, any 
density matrix U € By_, and any center vector m, the following bound holds: 


N compy m + A(U,W®) +n '(m—m?)'U(m—m®) 
1 —exp(—n) 


T-1 
miagh ) 








COMP alg < + 1 


where : 
COMP alg = Ł tr(W t(x — mT!) (x — m!) !) 
i=l 


is the overall expected compression loss of the centered online PCA Algorithm 6 and 


compy m = Eel )(x' —m)') 


is the total compression loss of comparison parameters (U ,m). 


Proof There are two main proof methods for the expert setting. The first is based on Bregman 
projections and was used so far in this paper. The second uses the value of the optimization problem 
defining the update as a potential and then shows that the drop of this value (Kivinen and Warmuth, 
1999; Cesa-Bianchiand and Lugosi, 2006) is lower bounded by a constant times the per trial loss 
of the algorithm. Here we use a refinement of the second method that expresses the value of the 
optimization problem in terms of its dual. These variations of the second method were developed 
in the context of boosting (Warmuth et al., 2006; Liao, 2007) and in the conference paper (Kuzmin 
and Warmuth, 2007) where we enhanced the Uncentered Online PCA Algorithm of this paper with 
a kernel. 

For our problem the value® of optimization problem (10) is vt = U'(W',m’) and this equals the 
value of the dual problem L‘(I") where I” maximizes the dual problem (13). 

We want to establish the following key inequality: 








v-v! > qo (1-e N) (te(Wi K =m m=) ") (17) 


Since I” optimizes the dual function L' and I~! is a non-optimal choice, T! > L (I!) and 


therefore 
v — t! = LE) =r) > Pia") Sate) (18) 


Substituting T! and L'~! from (13) into the right hand side of this inequality gives the following: 
EG) ae, Tt! (T!) 
= -n ' logtr(exp(logW° — nC’ -nr !)) + 1! logtr(exp(log W° — nc! — nr ')) 
= —n~' logtr(exp(log W° — nC’ — nr"! — logtr(exp(log W° — nC"! — nr" !))). 





Now we expand C’ and use the covariance matrix update from Lemma 7: 
T (T!) — E (T!) = -7 ' logtr(exp(logW° — nc! — nr! 


~] z 
a hug mT!) (x£ —m'—!)")). 





— logtr(exp(logW? — nC! — nr" !)) 





6. Optimization problem (10) minimizes a convex function subject to linear cone constraint. Since this problem has a 
strictly feasible solution, strong duality is implied by a generalized Slater condition (Boyd and Vandenberghe, 2004). 


2306 


ONLINE PCA 


The first four terms under the matrix exponential form log W’~!, which can be seen from the first 
expression for W~! from (14): 


LA 2) — ETIT!) 





zol 
= = n~ +t-l1 = x 
=- 'Jogtr(exp(logW' ty eer Of — mT!) (x — m T 
Going back to (18) we get the inequality: 
P Je 
jn i+t—1 





> —nlogtr(exp(log W‘' n (x! mi") —mi1)")), 


Nt 
This expression for the drop of the value is essentially the same expression that is normally bounded 
in the proof of online variance minimization algorithm in Warmuth and Kuzmin (2006a). Using 
those techniques (assumption in the theorem implies that that ||’ — m~} || < 1 and all the necessary 
inequalities hold) we get the following inequality: 


n '+r-1 
Ti+ 


(1 —e™)tr(W t(x — m!) (x — m7!) '). 





—y7! logtr(exp(logW”! n (x! — m!) (x — m'=!)")) 
~ 

> qn = tr 1 
nN +t 
W‘! is a density matrix and its eigenvalues are at most 1. And by assumption, norm of x’ — m'~! is 


at most 1. Therefore, the loss tr(W*! (xt —m'~!)(x! — m~!) T) is also at most 1. We split the factor 


in front of the loss as ” = E =] zp upper bounding the loss by 1 for the second part and 
leaving it as is for the first. With this (17) is obtained. 
Note that the trace in the inequality (17) is the loss of the algorithm at trial t. Summation over t 


and telescoping gives us: 





T 
1 
T _,0 -1 = 
v —v > 1-e" (com -ð > 1, 
n ( ) Palg d no! +t 

We consider the left side first: v? is equal to zero, and vf is a minimum of optimization problem 
(10), thus we can make it bigger by substituting arbitrary non-optimal values U and m. Index T 
means the optimization problem is defined with respect to the entire data sequence, therefore the 
loss term becomes the loss of the M On the right side we use the following bound on the 


sum of generalized harmonic series: rL fF l — mal + log (1+; f a4). Overall, we get: 
n 'A(U,W°) +H (m = m)'U(m m?) gg compy m 


T-1 
>n! (1 — e7’) (compas - (1 +log(1 +a) ; 





Moving things over and dividing results in the bound of the theorem. E 
As discussed before, when n7! = n~! = 0, then the algorithm becomes the FL Algorithm. When 


qo! — œ, then m‘ is clamped to m?®, that is, the update for the center (11,15) becomes m’ = m? and 
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is vacuous. Also in that case the term nT)! (m — m?) ' U (m — m?) in the upper bound of Theorem 
8 is infinity unless the comparison center m is m? as well. If m? = 0 in addition to 77! — ©, then 
we call this the uncentered version of Algorithm 6: this version simply ignores step (15) and in (16) 
uses m'—! = 0. Our original Algorithm 5 for uncentered PCA as well as the uncentered version of 
Algorithm 6 have the same regret bound’ of Theorem 6. Recall however that the two algorithms 
were motivated differently: Algorithm 5 trades off divergence to the last parameter with the loss in 
the last trial, whereas Algorithm 6 trades off a divergence to the initial parameter matrix with the 
total loss in all past trials. If all constraints are equality constraints, then the two algorithms are 
the same. However, capping introduces inequality constraints and therefore the two algorithms are 
decidedly not the same. Both algorithm can behave quite differently experimentally (Section 8). 
The difference between the two algorithms will become important in the followup paper (Kuzmin 
and Warmuth, 2007), where we were only able to use a kernel with the algorithm that trades off a 


divergence to the initial parameter matrix with the total loss in all past trials. 


Similarly, when n7t — œ, then W’ is clamped to W? and the algorithm degenerates to a pre- 


viously analyzed algorithm, the Incremental Off-line Algorithm for Gaussian density estimation 
with fixed covariance matrix (Azoury and Warmuth, 2001). For this restricted density estimation 
problem, improved regret bounds were proven for the Forward Algorithm which further shrinks the 
estimate of the mean towards the initial mean. So far we were not able to improve our regret bound 
for uncentered PCA using additional shrinkage towards the initial mean. 

The statement of the theorem requires strong initial knowledge about the center of the data 
sequence we are about to observe: the condition of the theorem says that our data sequence has to 
be contained in a ball of radius 5 around m°. This can be relaxed by using m? = 0 and n~! = 0, 
which corresponds to using standard empirical mean for m’. Now it suffices to assume that data is 
contained in some ball, but we are not required to know where exactly that ball is. The appropriate 
assumption and the change to the bound are detailed in the following corollary. 


Corollary 9 For any data sequence x',...,x" that can be covered by a ball of radius 5 that is, 


||"! —x"||2 < 1 and that also has the bound on the norm of instances ||x'||x < R, any density matrix 
U € B’_, and any center vector m, the total expected loss of centered online PCA Algorithm 6 being 
used with parameters )~' = 0 and m? = 0 is bounded as follows: 


n comMPy m +A(U,W°?) 
COMP alg < 


< i ij + logT + R’, 
—e 





Proof The ball assumption means that the empirical mean m’~! and any element of the data se- 
quence are not too far from each other: ||m’~! — x'||2 < 1. Thus we can still use the Inequality (17), 
for all trials but the first one, where we haven’t seen any data points yet. Summing the drops of the 
value starting from t = 1 we get: 


T 21 
yI —y! > nl —e") [Zea —m) (x —m'~')") =, > r) . 


t=2 2t 





7. The remaining +1 is an artifact of our bound on the harmonic sum. 
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We now add the loss of the first trial into the sum and rearrange terms: 


COMPalg 


err vw—_——~ 


tr(wi! (x = m!) (x = m—1)") 


Mns 


1 


~ 
Il 





>0 <logT ; 
<R 
T 1 T p 
— 1 
< 1 — ) X +tr(W? (x! — m?) (x! — m?) "). 
t=2 


Finally, from the definition of v7 it follows that v? < n~'A(U,W®) + compy m» for any comparator 
U and m, and this gives the bound of the theorem. | 


Tuning ņ as in (3), Corollary 9 gives the following regret bound for our centered online PCA 
Algorithm 6 (when k < 5): 


(expected total compression loss of alg.) - (total comp. loss of best centered k-subspace) 





< y 2(total comp. loss of best centered k-subspace) klog = +klog = +R? +logT. 


6.3 Converting the Online PCA Algorithms to Batch PCA Algorithms 


In the online learning community a number of conversion techniques have been developed that 
allow one to construct a hypothesis with good generalization bounds in the batch setting from the 
hypotheses produced by a run of the online learning algorithm over the given batch of examples. 

For example, using the standard conversion techniques developed for the expert setting based 
on the leave-one-out loss (Cesa-Bianchi et al., 1997), we obtain algorithms with good expected 
regret bounds in the following model: The algorithm is given T — 1 instances drawn from a fixed 
but unknown distribution and produces a k-dimensional subspace based on those instances; it then 
receives a new instance from the same distribution. We can bound the expected loss on the new 
instance (under the usual norm less than one assumption on instances): 


(expected compression loss of alg.) - (expected compression loss best k-space) 








== if) ( y (expected compression loss of best k-subspace)klog ž i klog% ) l 
T T 
The expected loss of the algorithm is taken as expectation over both the internal randomization of 
the algorithm and fixed distribution over the instances. The expected loss of the best subspace just 
averages over the distribution of the instances. The best subspace itself will be determined by the 
covariance matrix of this distribution. 

Additionally, there also exist very general conversion methods that allow us to state bounds 
that say that the generalization error will be big with small probability (Cesa-Bianchi and Gentile, 
2005). These bounds are more complicated and therefore we don’t state them here. The conversion 
algorithms however, are pretty simple: for example, one can use the average density matrix of 
all density matrices produced by the online algorithm while doing one pass through the batch of 
instances. Perhaps surprisingly, the generalization bounds for batch PCA obtained via the online- 
to-batch conversions are competitive with the best bounds for batch PCA that we are aware of 
Shawe-Taylor et al. (2005). 
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7. Lower Bounds 


We first prove some lower bounds for the simplest online algorithm that just predicts with the model 
that has incurred minimum loss so far (the Follow the Leader (FL) Algorithm). After that we give a 
lower bound for uncentered PCA that shows that the algorithm presented in this paper is optimal in 
a very strong sense. 

Our first lower bound is in the standard expert setting. We assume that there is a deterministic 
tie-breaking rule, because by adding small perturbations, ties can always be avoided in this con- 
struction. It is easy to see that the following adversary strategy forces FL to have loss n times larger 
than the loss of the best expert chosen in hindsight: in each trial have the expert chosen by FL incur 
one unit of loss. Note that the algorithm incurs loss one in each trial, whereas the loss of the best 
expert is [| after T trials. We conclude that the loss of FL can be by a factor of n larger than the 
loss of the best expert. 

We next show that for the set expert case, FL can be forced to have loss at least 4 times the loss 
of the best set of size d. In this case FL chooses a set of size d of minimum loss and the adversary 
forces the lowest loss expert in the set chosen by FL to incur one unit of loss. The algorithm again 
incurs loss one in each trial, but the loss of the best set lies in the range d [|4|,{4]]. Thus in this 
case the loss of FL can be by a factor of 4 larger than the loss of the best set of size d. 

When rephrased i.t.o. compression losses, FL picks a set of size n — d whose complementary 
set of size d has minimum compression loss. We just showed that the total compression loss of FL 
can be at least 4 times the compression loss of the best subset of size n — d. 

We can lift the above lower bound for sets to the case of uncentered PCA. Now d =n — k and k 
is the rank of the subspace we want to compress onto. To simplify the argument, we let the first n 
instances be small multiples of the standard basis vectors. More precisely, x, = tee’, for 1 <t < n 
and small real £. These instances cause the uncentered data covariance matrix 1", xx; to be a 
diagonal matrix. Also, if € is small enough then the loss in the first n trials is negligible. From now 
on FL always chooses a unique set of d = n—k standard basis vectors of minimum loss and the 
adversary chooses a standard basis vector with the lowest loss in the set as the next instance. So the 
lower bound argument essentially reduces to the set case, and FL can be forced to have compression 
loss —*_ times the loss of the compression loss of the best k dimensional subspace. 

So far we have shown that our online algorithms are better than the simplistic FL Algorithm 
since their compression losses are at most one times the loss of the best plus essentially a square 
root term. We now show that the constant in front of the square root term is rather tight as well. For 
the expert setting (d = 1) this was already done: 


Theorem 10 (Theorem C.3. of the journal version Helmbold and Warmuth 2008 of the conference 
paper Helmbold and Warmuth 2007.) For all € > Q there exists ng such that for any number of 
experts n > nę, there exists a Te», where for any number of trials T > Tz the following holds for 
any algorithm in the expert setting: there is a sequence of T trials with n experts for which the loss 
of the best expert is at most T /2 and the regret of the algorithm is at least (1 —€),/(T /2) logn. 


In the expert model used in this paper, we follow Freund and Schapire (1997) and assume that 
the losses of the experts in each trial are specified by a loss vector in [0,1]%. There is a related 
model (studied earlier), where the experts produce predictions in each trial. After receiving those 
predictions the algorithm produces its own prediction and receives a label. The loss of the experts 
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and algorithm is the absolute value of the difference between the predictions and the label, respec- 
tively (Littlestone and Warmuth, 1994). The above theorem actually holds for this model of online 
learning with the absolute loss (when all predictions are in [0,1] and the labels are in {0,1}), and 
the model used in this paper may be seen as the special case where the prediction of the algorithm is 
formed by simply averaging the predictions of the experts (Freund and Schapire, 1997). Therefore 
any lower bound for the described expert model with absolute loss immediately holds for the expert 
model where the loss is specified by a loss vector. 


Note that the regret bounds for the Hedge Algorithm discussed at the end of Section 3 have 
an additional factor of 2 in the square root term. By choosing a prediction function other than the 
weighted average, the factor of 2 can be avoided in the expert model with the absolute loss, and 
the upper and lower bounds for the regret have the same constant in front of the square root term 
(provided that N and T are large enough) (Cesa-Bianchi et al., 1997; Cesa-Bianchiand and Lugosi, 
2006). 


The above lower bound theorem immediately generalizes to the case of set experts. Partition 
the experts into d blocks of size 7 (assume d divides n). For any algorithm and block, construct 
a sequence of length T as before. During the sequence for one block, the experts for all the other 
blocks have loss zero. The loss of the best set of size d on the whole sequence of length Td is at 
most Td/2 and the regret, that is, the loss of the algorithm on the sequence minus the loss of the 


best set of size d, is lower bounded by 


T n dT n 
1 l = (1 l : 
(1-e)dy/ 5 log% = (1—8) diog” 


Rewritten in terms of compression losses for compression sets of size k (i.e., d = n — k), the lower 
bound on the compression loss regret becomes 











(1-8) [compression loss of best k-subset (n — k) log i i (19) 
n — 
Note that the upper bound (4) obtained by our algorithm for learning as well as the best subset is 
essentially a factor of \/2 larger than this lower bound. 


Finally, we lift the above lower bound for subsets to a lower bound for uncentered PCA. In the 
setup for uncentered PCA, the instance matrix at trial t is S’ = xt (xt)! , where x’ has 2-norm at most 
1. For the lower bound we need the instance matrix S’ to be an arbitrary symmetric matrix with 
eigenvalues in [0,1]. As discussed before Section 5.1, the upper bound for uncentered PCA still 
holds for these more general instance matrices. 


To lift the lower bound for subsets to uncentered PCA, we simply replace the loss vector 4 by 
the instance matrix S’ = diag(¢’). At trial t, the PCA algorithm uses the density matrix W’~! and 
incurs expected loss tr(W‘! diag(¢’)) = diag(W*!) - 2’. Note that the diagonal vector diag(W*!) is 
a probability vector. Thus the PCA algorithm doesn’t have any advantage from using non-diagonal 
density matrices and the lower bound reduces to the set case. We conclude that the lower bound 
(19) also holds for the compression loss regret of uncentered PCA algorithms when the instance 
matrices are allowed to be symmetric matrices with eigenvalues in [0,1]. Again the corresponding 
upper bound (7) is essentially a factor of 2 larger. 
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Figure 3: The data sequence used for the Figure 4: The blue/solid curve is the total loss of uncen- 
first experiment switches be- tered online PCA Algorithm 5 for the data se- 
tween three different subspaces. quence described in Figure 3 (with n = 20,k = 2 
It is split into three segments. and 7 = 1). The algorithm uses internal ran- 
Within each segment, the data domization for choosing a subspace and there- 
is drawn from a different 20- fore the curve is actually the average total loss 
dimensional Gaussian with a over 50 runs for the same data sequence. The 
rank 2 covariance matrix. We error bars (one standard deviation) indicate the 
plot the first three coordinates variance of the algorithm. The black/dash-dotted 
of each data point. Different curve plots the same for the uncentered version 
colors/symbols denote the data of Algorithm 6 (again n = 1). The visible bumps 
points that came from the three in the curves correspond to places in the data se- 
different subspaces. quence where it shifts from one subspace to an- 


other. The red/dashed curve is the total loss of 
the best projection matrix determined in hind- 
sight (i.e., loss of batch uncentered PCA). The 
green/dotted curve is the total loss of the Follow 
the Leader Algorithm. 


8. Simple Experiments 


The regret bounds we prove for our online PCA algorithms hold for arbitrary sequences of instances. 
In other words, they hold even if the instances are produced by an adversary which aims to make 
the algorithm have large regret. In many cases, natural data does not have a strong adversarial 
nature and even the simple Follow the Leader Algorithm might have small regret against the best 
subspace chosen in hindsight. However, natural data commonly shifts with time. It is on such time- 
changing data sets that online algorithms have an advantage. In this section we present some simple 
experiments that bring out the ability of our online algorithms to adapt to data that shifts with time. 


For our first experiment we constructed a simple synthetic data set of time-changing nature. 
The data sequence is divided into three equal intervals, of 500 points each. Within each interval 
data points are picked at random from a multivariate Gaussian distribution on R? with zero mean. 
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Figure 5: Behavior of the Uncentered Online PCA Algorithm 5 when data shifts from one subspace to 
another. First shift for one of the runs in Figure 4 is shown. We show the projection matrices that 
have the highest probability of being picked by the algorithm in a given trial. Since k = 2, each 
such matrix P’ can be seen as a 2-dimensional ellipse in R79: the ellipse is formed by points P'x 
for all ||x||2 = 1. We plot the first three coordinates of this ellipse. The transition sequence starts 
with the algorithm focused on the optimal projection matrix for the first subset of data and ends 
with essentially the optimal matrix for the second subset. The depicted transition takes about 60 
trials and only every 5th trial is plotted. 


The covariance matrices for the Gaussians were picked at random but constrained to rank 2, thus 
ensuring that the generated points lie in some 2-dimensional subspace. The generated points with 
norm bigger than one were normalized to 1. The data set is graphically represented in Figure 3, 
which plots the first three dimensions of each one of the data points. Different colors/symbols 
indicate data points that came from the three different subspaces. 


In Figure 4 we plot the total compression loss for some of the algorithms introduced in this 
paper. For the sake of simplicity we restrict ourselves to uncentered PCA. Here the data dimension 
n is 20 and the subspace dimension k is 2. We plot the total loss of the following algorithms as a 
function of the trial number: the FL Algorithm, the original uncentered PCA Algorithm 5 and the 
uncentered version of Algorithm 6. For the latter two algorithms we need to select a learning rate. 
One possibility is to choose the learning rate that optimizes our upper bound on the regret of the 
algorithms (3). Since the bound is the same for both algorithms this choice of 7 is also the same: 
the choice depends on an upper bound on the compression loss of the batch algorithm. Plugging 
in the actual compression loss of the batch algorithm gives n = 0.12. In practice, heuristics can be 
used to tune n. For the experiment of Figure 4 we simply chose n = 1. Recall that our online PCA 
algorithms decompose their density matrix parameter W* into a convex combination of projection 
matrices using the deterministic Algorithm 2. However, the PCA algorithms then randomly select 
one of the k dimensional projection matrices in the convex combination with probability equal to 
its coefficient. This introduces randomness into the execution of the algorithm even when run on a 
fixed data sequence. We run the algorithms 50 times and plot the average total loss as a function 
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of ¢ for the fixed data sequence depicted in Figure 3. We also indicate the variance of this total 
loss with error bars of one standard deviation. Note again that the average and variance is w.r.t. 
internal randomization of the algorithm. The bumps in the loss curves of Figure 4 correspond to the 
places in the data sequence where it shifts from one subspace to another. When the loss curve of an 
algorithm flattens out then the algorithm has learned the correct subspace for the current segment. 
For example, Figure 5 depicts how the density matrix of the uncentered PCA Algorithm 5 (N = 1) 
transitions around a segment boundary. 

The FL Algorithm (which coincides with the uncentered version of Algorithm 10 when n — œ) 
learns the first segment really quickly, but it does not recover during the later segments. The original 
uncentered PCA Algorithm 5 (with n = 1) recovers in each segment, whereas the uncentered version 
of Algorithm 6 (with n = 1) does not recover as quickly and has higher total loss on this data 
sequence. Recall that both online algorithms for uncentered PCA have the same regret bound against 
the best fixed offline comparator. 

In Figure 4 we also plot the compression loss of the best subspace selected in hindsight by run- 
ning batch uncentered PCA (dashed/red line). The data set we generated is not well approximated 
by a single 2-dimensional subspace, since it consists of three different 2-dimensional subspaces. As 
a result, the overall loss of the fixed subspace is higher than the loss of both of our uncentered online 
PCA algorithms that to a varying extent were able to switch between the different 2-dimensional 
subspaces. 

There are many heuristics for detecting switches of the data. For example one could simply 
check for a performance loss of any algorithm in a suitably chosen final segment. However, such 
algorithms are often unwieldy and are hard to tune when the data in difference segments are not 
drastically different. Note that for the Uncentered Online PCA Algorithm presented in this paper 
we only have provable good regret bounds against the best fixed subspace chosen offline (based on 
the entire data sequence). (In Figure 4 both of our online algorithms beat the offline comparator 
and thus have negative regret against this simple comparator.) Ideally we would like measure regret 
against stronger offline comparators that can exploit the shifting nature of the data. In the expert 
setting there is a long line of research where the offline algorithm is allowed to partition the data 
sequence into s segments and pick the best subspace for each of the s segments. There are simple 
modifications of the online algorithm in the experts setting that have good regret bounds against the 
best partition of size s (see, e.g., Herbster and Warmuth, 1998). Naturally, the regret bounds grow 
with the number of segments s. The modifications that are needed to achieve these regret bounds are 
simple: insert an additional update step at the end of each trial that mixes a little bit of the uniform 
distribution into the weight vector. In the context of uncentered PCA this amounts to adding the 
following update at the end of each trial: 


W=(1 _ow+at. 
n 


We claim that it is easy to again lift the techniques developed in the expert setting to PCA and prove 
the analogous regret bounds against the best partition. 

The following subtle modification of the mixing rule leads to algorithms with even more in- 
teresting properties. If the algorithm mixes in a little bit of the average of all past weight vectors 
instead of the uniform vector, then it is able to switch quickly to experts that have been good at 
some time in the past. This has been called the “long-term memory” property of such algorithms. 
In the context of uncentered PCA, this amounts the following additional update step for the density 
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Figure 6: Long term memory effect when mixing in past average is added to the uncentered online PCA 
Algorithm 5. The data sequence is comprised of several segments, each one of two hundred 
randomly sampled images of the same person but with different facial expressions and lighting 
conditions. These segments are indicated with dotted lines and are labeled with the face of the 
person that was used to generate the segment. The plot depicts the regret of the uncentered online 
PCA Algorithm 5 (qn = 1) with added mixing update (20) (a = .001). The regret is w.r.t. the 
indicated partition of size six. 


matrix: 
Ka Wa 
W‘ = (1—a)W! + a 2 £ 


(20) 


We performed another experiment that demonstrates this “long-term memory” effect by adding 
update (20) to the uncentered online PCA Algorithm 5: For this experiment we used face image data 
from Yale-B data set. A segmented data sequence was constructed by sampling the face images, 
where each segment contained images of the same person, but with different facial expressions, 
lighting conditions, etc. Figure 6 plots the regret of our uncentered online PCA Algorithm 5 against 
the best partition of size 6 chosen offline (n = 1,a= .001). In the picture the segment boundaries are 
indicated with dotted lines and the face below each segment shows the person who’s pictures were 
sampled during that section. Note that our online algorithm doesn’t have knowledge of segment 
boundaries. The first segment shows the typical behavior of online algorithms: the regret grows 
initially, but then flattens out once the algorithm converges to the best solution for that segment. 
Thus the “bump” in the regret curve is due to the fact that the algorithm has to learn a new segment 
of pictures from a different person. By comparing the bumps of different segments we see that they 
are significantly smaller in segments where the algorithm encounters pictures from a person that it 
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has previously seen: the algorithm “remembers” subspaces that were good in the past. These small 
bumps can be seen in segments 3,5 and 6 in our data set. For additional plots of long-term memory 
effects in the expert setting see Bousquet and Warmuth (2002). Our experiments are preliminary and 
we did not formally prove regret bounds for shifting comparators in PCA setting, but such bounds 
are straightforwardly obtained using the methodology from this paper and Bousquet and Warmuth 
(2002). 


9. Conclusions 


We developed a new set of techniques for learning as well as a low dimensional subspace. We first 
developed the algorithms in the expert case and then lifted these algorithms and bounds to the matrix 
case as essentially done in Tsuda et al. (2005); Warmuth and Kuzmin (2006a). The new insight is 
to represent our uncertainty over the subspace as a density matrix with capped eigenvalues. We 
show how to decompose such a matrix into a mixture of n subspaces of the desired rank. In the case 
of PCA the seemingly quadratic compression loss can be rewritten as a linear loss of our matrix 
parameter. Therefore a random subspace chosen from the mixture has the same expected loss as the 
loss of the parameter matrix. 

Similar techniques (albeit not capping) have been used recently for learning permutations (Helm- 
bold and Warmuth, 2008): Instead of maintaining a probability vector over the n! permutations, the 
parameter used in that paper is a convex combination of permutation matrices which is an n x n 
doubly stochastic matrix. This matrix can be efficiently decomposed into a mixture of O(n?) per- 
mutation matrices. If the loss is linear, then a random permutation chosen from the mixture has the 
same expected loss as the loss of the doubly stochastic parameter matrix. 

When the loss is convex and not linear (as for example the quadratic loss in the generalization 
of linear regression to matrix parameters Tsuda et al. 2005), then our new capping trick can still be 
applied to the density matrix parameter. In this case, the online loss of the capped density matrix can 
be shown to be close to the loss of the best low dimensional subspace. However the quadratic loss 
of the capped density matrix (which is a convex combination of subspaces) can be much smaller 
than the convex combination of the losses of the subspaces. The bounds of that paper only hold for 
the smaller loss of the capped density matrix and not for the expected loss of the subspace sampled 
from the convex combination represented by the capped density matrix. 

Our new capping technique is interesting in its own right. Whereas the vanilla multiplicative 
update on n experts with exponential factors is prone to be unstable because it tends to go into a 
corner of the simplex (by putting all weight on the currently best expert), the technique of capping 
the weight at i keeps a set of at least d experts alive. In some sense the capping has the same effect 
as a “super predator” has on a biological system: such a predator specializes on the most frequent 
species of prey, preventing the dominance of any particular species and thus preserving variety. See 
Warmuth (2007a) for a discussion of the relationships of our updates to biological systems. 

Following Kivinen and Warmuth (1997); Helmbold et al. (1999), we motivate our updates by 
trading off a parameter divergence against a loss divergence. In our case, the parameter divergence 
is always a capped relative entropy and the loss divergence is simply linear. It would be interesting 
to apply the capping trick to the logistic loss used in logistic regression. The logistic loss can be 
seen as a relative entropy between the desired probabilities of the outcomes and the predicted prob- 
abilities of the outcomes obtained by applying a sigmoid function to the linear activations (Kivinen 
and Warmuth, 2001). For “capped logistic regression’, linear constraints would be added to the op- 
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timization problem defining the updates that cap the predicted probabilities of the outcomes. This 
would lead to versions of logistic regression where the loss favors a small set of outcomes instead 
of a single outcome. 

The algorithms of this paper belong to the family of multiplicative updates, that is, the parame- 
ters are updated by multiplicative factors of exponentials. In the matrix case the updates make use 
of the matrix log and matrix exponential. There is a second family of updates that does additive 
parameter updates. In particular, there are additive online updates for PCA (Crammer, 2006). The 
latter update family has the key advantage that they can be easily used with kernels. However, in 
a recent conference paper (Kuzmin and Warmuth, 2007) we also were able to give a special case 
where the multiplicative updates could be enhanced with a kernel. In this case the instance ma- 
trices are outer products x(x)" and are replaced by 6(x)‘(o(x’))'. In particular, the online PCA 
algorithms of this paper can be “kernelized” this way. 

In PCA the instance matrices are symmetric matrices x'(x')' and we seek a low rank symmetric 
subspace that approximates the instances well. In a recent conference paper we showed in a slightly 
different context how to generalize our methods to the case of “asymmetric” matrices of arbitrary 
shape. Using those techniques, the online PCA algorithms of this paper can be generalized to the 
asymmetric case: now the instance matrices are rank one n x m matrices of the form x’ (x")', where 
x and x are vectors of dimension n and m, respectively. In the asymmetric case, the underlying 
decomposition is the SVD decomposition instead of the eigendecomposition. 

There are two technical open problems that arise in this paper. We gave a number of dynamic 
programming algorithms, such as the one given for the set expert problem. If the loss range for the 
individual experts is [0,1] then the loss range for the set experts is [0,d] when d is the size of the 
set. The straightforward application of results for the Hedge Algorithm leads to extra factors of d 
in the regret bounds for set experts. We avoided these factors using our capping trick. However, 
the question is whether the same bounds (without the d factors) hold for the dynamic programming 
algorithm as well. There is a different fundamental algorithmic technique for dealing with structural 
domains: the Follow the Perturbed Leader Algorithm (Kalai and Vempala, 2005). The second open 
problem is how to adapt this algorithm to online PCA and prove bounds for it that don’t contain the 
extra d factors. 

Finally, independent of what theoretical progress might have been achieved in this paper, we still 
have to find a convincing experimental setting where our new online PCA algorithms are indispens- 
able. The update time of our algorithms is O(n”) and further time improvements via approximations 
or lazy evaluation might be needed to make the algorithms widely applicable. 
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